Language Classification and Segmentation of Noisy Documents in Hebrew Scripts
نویسندگان
چکیده
Language classification is a preliminary step for most natural-language related processes. The significant quantity of multilingual documents poses a problem for traditional language-classification schemes and requires segmentation of the document to monolingual sections. This phenomenon is characteristic of classical and medieval Jewish literature, which frequently mixes Hebrew, Aramaic, Judeo-Arabic and other Hebrew-script languages. We propose a method for classification and segmentation of multi-lingual texts in the Hebrew character set, using bigram statistics. For texts, such as the manuscripts found in the Cairo Genizah, we are also forced to deal with a significant level of noise in OCR-processed text.
منابع مشابه
Document Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملLearning Segmentation of Documents with Complex Scripts
Most of the state-of-the-art segmentation algorithms are designed to handle complex document layouts and backgrounds, while assuming a simple script structure such as in Roman script. They perform poorly when used with Indian languages, where the components are not strictly collinear. In this paper, we propose a document segmentation algorithm that can handle the complexity of Indian scripts in...
متن کاملOCR for printed Kannada text to Machine editable format using Database approach
This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The system first extracts image of Kannada scripts, then from the image to line segmentation then segments the words into sub-character level piece...
متن کاملMinimizing Loss of Information at Competitive PLIP Algorithms for Image Segmentation with Noisy Back Ground
In this paper, two training systems for selecting PLIP parameters have been demonstrated. The first compares the MSE of a high precision result to that of a lower precision approximation in order to minimize loss of information. The second uses EMEE scores to maximize visual appeal and further reduce information loss. It was shown that, in the general case of basic addition, subtraction, or mul...
متن کاملPersian/Arabic Document Segmentation Based On Pyramidal Image Structure
Automatic transformation of paper documents into electronic documents requires document segmentation at the first stage. However, some parameters restrictions such as variations in character font sizes, different text line spacing, and also not uniform document layout structures altogether have made it difficult to design a general-purpose document layout analysis algorithm for many years. Thus...
متن کامل